2. Feature Engineering

Python
Machine Learning
Data Visualization
Published

September 21, 2025

Brief Look at the Dataset

First we need to load our dataframe from the csv file we created in part 1. Then, lets take a look at all the columns in the dataset.

import pandas as pd
import ast

df = pd.read_csv('data/first_gen_pokemon_cards.csv')

columns_to_parse = ['weaknesses', 'resistances', 'subtypes', 'types', 'abilities', 'attacks', 'nationalPokedexNumbers', 'evolvesTo', 'rules']
for col in columns_to_parse:
    if col in df.columns:
        df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x != 'nan' and pd.notna(x) else ([] if col != 'nationalPokedexNumbers' else None))

print(df.columns)
Index(['id', 'name', 'supertype', 'subtypes', 'level', 'hp', 'types',
       'evolvesFrom', 'abilities', 'attacks', 'weaknesses', 'retreatCost',
       'convertedRetreatCost', 'number', 'artist', 'rarity', 'flavorText',
       'nationalPokedexNumbers', 'legalities', 'images', 'evolvesTo',
       'resistances', 'rules', 'regulationMark', 'ancientTrait'],
      dtype='object')

I took these columns and created a simple data dictionary for reference:

Column Name Data Type Description Allowed Values Examples Missing Values
id String Unique identifier for each card Alphanumeric strings “xy7-54”, “sm3-22” No
name String Name of the Pokemon card Alphanumeric strings “Pikachu”, “Charizard” No
supertype String Broad category of the card “Pokémon” “Pokémon” No
subtype String More specific category within the supertype Array of strings [“Basic”, “Stage 1”, “Stage 2”, “EX”, “Team Plasma”…] No
level String Level of the Pokémon (if applicable) Alphanumeric strings or X “12”, “45”, “X” Yes
hp Integer Hit points of the Pokémon Positive integers 60, 120, 200 No
types Array of strings Types of the Pokémon [“Fire”, “Water”, “Grass”, “Electric”, “Psychic”, “Fighting”, “Darkness”, “Metal”, “Fairy”, “Dragon”, “Colorless”] [“Fire”], [“Water”, “Flying”] No
evolvesFrom String Name of the Pokémon this card evolves from (if applicable) Alphanumeric strings “Pikachu”, “Charmander” Yes
abilities Array of objects Special abilities of the Pokémon Objects with name, text, and type fields [{name: “Static”, text: “May paralyze opponent’s Pokémon”, type: “Poké-Body”}] Yes
attacks Array of objects Attacks that the Pokémon can perform Objects with name, cost, convertedEnergyCost, damage, and text fields [{name: “Thunder Shock”, cost: [“Electric”, “Colorless”], convertedEnergyCost: 2, damage: “30”, text: “May paralyze opponent’s Pokémon”}] Yes
weaknesses Array of objects Weaknesses of the Pokémon Objects with type and value fields [{type: “Fighting”, value: “×2”}] Yes
retreatCost Array of strings Energy types required to retreat the Pokémon [“Colorless”] [“Colorless”, “Colorless”] Yes
convertedRetreatCost Integer Total number of energy required to retreat the Pokémon Non-negative integers 1, 2, 3 Yes
number String Card number within its set Alphanumeric strings “54”, “22” No
artist String Name of the card’s illustrator Alphanumeric strings “Mitsuhiro Arita”, “5ban Graphics” Yes
rarity String Rarity level of the card “Common”, “Uncommon”, “Rare”, “Holo Rare”, “Ultra Rare”, “Secret Rare”, etc. “Common”, “Holo Rare” Yes
flavorText String Flavor text providing background or lore about the Pokémon Alphanumeric strings “When several of these Pokémon gather, their electricity could build and cause lightning storms.” Yes
nationalPokedexNumbers Array of integers National Pokédex numbers associated with the Pokémon Positive integers [25], [6] No
legalities Object Legality of the card in various formats Fields for “expanded”, “standard”, “unlimited” with values “Legal” or “Not Legal” {expanded: “Legal”, standard: “Not Legal”, unlimited: “Legal”} No
images Object URLs for the card’s images Fields for “small” and “large” with URL strings {small: “http://…”, large: “http://…”} No
evolvesTo Array of strings Names of Pokémon this card can evolve into (if applicable) Alphanumeric strings [“Raichu”, “Pikachu Libre”] Yes
resistances Array of objects Resistances of the Pokémon Objects with type and value fields [{type: “Metal”, value: “-20”}] Yes
rules Array of strings Special rules that apply to the card Alphanumeric strings [“If this Pokémon is your Active Pokémon, your opponent’s attacks do 20 less damage (before applying Weakness and Resistance).”] Yes
regulationMark String Regulation mark for tournament legality Single uppercase letters “D”, “E” Yes
ancientTrait Object Ancient Trait of the Pokémon (if applicable) Object with name and text fields {name: “Delta Evolution”, text: “This Pokémon can evolve from any type of basic Pokémon.”} Yes

We can see that there are quite a few features that are not necessary; the obvious ones are id and imagessince these features are unique identifiers and urls. We can drop these columns from the dataframe. Now we can focus on the features that would help a model learn the game mechanics that determines the hit points of a pokemon card. Given that this is our goal, we can also drop legalities and regulationMark columns since these columns pertain to the actual card game rules and not the pokemon card itself. Finally, we can also drop the supertype column since all of the cards in our dataset are of the same supertype Pokémon.

The other features still have some columns that I believe are not useful for predicting the hit points of a pokemon card but it is hard to tell without running through some analysis.

Feature Engineering

I look all the columns in the dataset and decided on the following feature engineering steps:

Column Name Feature Engineering Steps
id We will drop this column since it is a unique identifier and does not provide any useful information for predicting hit points.
images We will drop this column since it contains URLs to images and does not provide any useful information for predicting hit points.
legalities We will drop this column since it pertains to the card game rules and not the pokemon card itself.
regulationMark We will drop this column since it pertains to the card game rules and not the pokemon card itself.
supertype We will drop this column since all of the cards in our dataset are of the same supertype Pokémon.
hp This is our target variable that we are trying to predict. We don’t need to do any feature engineering on this column.
level Most of the values in this column are missing, but we can fill in the missing values with the median level of the pokemons and create a new feature indicating whether the level was missing or not.
nationalPokedexNumbers We will convert this to a numerical value by taking the first number in the array.
convertedRetreatCost This is already a numerical value and can be used as is. We will fill in any missing values with 0.
rarity We can use one-hot encoding to convert this categorical feature into multiple binary features.
evolvesFrom We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
evolvesTo This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
subtypes We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
types Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features
weaknesses We can extract three features from this column:
- weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
- total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
resistances Similar to weaknesses, we can extract three features from this column:
- resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
- total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
retreatCost Since this column is the same as convertedRetreatCost, we can drop this column.
name This feature has a very high cardinality. Originally my idea was to count the number of times each name appears in the dataset and use that as a feature. However, we can already count this using the nationalPokedexNumbers feature since each pokemon name corresponds to a unique pokedex number. Therefore, we can drop this feature.
artist Similar to name, we can count the number of times each artist appears in the dataset and use that as a feature.
abilities I will split this into three features:
- ability_count: The number of abilities the pokemon has.
- ability_text: The combined text of all abilities.
- has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
attacks Similar to abilities, I will split this into three features:
- attack_count: The number of attacks the pokemon has.
- attack_text: The combined text of all attacks.
rules I will create a binary feature indicating whether the pokemon has any special rules or not.
ancientTrait I will create a binary feature indicating whether the pokemon has an ancient trait or not.
flavorText I believe the flavor text does not provide any information that could help us predict the HP of a pokemon card but lets use TfidfVectorizer and run analysis on it to see.

Let also take a brief look at the data in our dataframe before we proceed with the feature engineering steps.

print(df.shape)
df.head(3)
(4470, 25)
id name supertype subtypes level hp types evolvesFrom abilities attacks ... rarity flavorText nationalPokedexNumbers legalities images evolvesTo resistances rules regulationMark ancientTrait
0 base1-1 Alakazam Pokémon [Stage 2] 42 80 [Psychic] Kadabra [{'name': 'Damage Swap', 'text': 'As often as ... [{'name': 'Confuse Ray', 'cost': ['Psychic', '... ... Rare Holo Its brain can outperform a supercomputer. Its ... [65] {'unlimited': 'Legal'} {'small': 'https://images.pokemontcg.io/base1/... [] [] [] NaN NaN
1 base1-2 Blastoise Pokémon [Stage 2] 52 100 [Water] Wartortle [{'name': 'Rain Dance', 'text': 'As often as y... [{'name': 'Hydro Pump', 'cost': ['Water', 'Wat... ... Rare Holo A brutal Pokémon with pressurized water jets o... [9] {'unlimited': 'Legal'} {'small': 'https://images.pokemontcg.io/base1/... [] [] [] NaN NaN
2 base1-3 Chansey Pokémon [Basic] 55 120 [Colorless] NaN [] [{'name': 'Scrunch', 'cost': ['Colorless', 'Co... ... Rare Holo A rare and elusive Pokémon that is said to bri... [113] {'unlimited': 'Legal'} {'small': 'https://images.pokemontcg.io/base1/... [Blissey] [{'type': 'Psychic', 'value': '-30'}] [] NaN NaN

3 rows × 25 columns

Cleaning the Data

In this section we will focus on dropping columns and extracting features from our initial list of features. We will then transform and scale them in the next section. Lets first drop these columns that we decided aren’t useful from the dataframe:

  • id
  • images
  • legalities
  • regulationMark
  • supertype
  • retreatCost
  • name
df.drop(columns=['id', 'images', 'legalities', 'regulationMark', 'supertype', 'retreatCost', 'name'], inplace=True)

Direct Numerical Features

We can start with the columns that are already numerical values. These columns are:

  • level: I am replacing X found in levels with 100 which is the highest level you can train a pokemon to in a game. I will be filling in missing data later.
df['level'] = df['level'].apply(
  lambda x: int(x.replace('X', '100')) if isinstance(x, str) and x != 'nan' and pd.notna(x) else None
)
  • level_was_missing: A binary feature indicating whether the level was missing or not.
df['level_was_missing'] = df['level'].isnull().astype(int)
  • nationalPokedexNumbers: We will convert this to a numerical value by taking the first number in the array.
df['primary_pokedex_number'] = df['nationalPokedexNumbers'].apply(
    lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
)
  • pokemon_count: Counts how many Pokemon are in the nationalPokedexNumbers array.
df['pokemon_count'] = df['nationalPokedexNumbers'].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)
  • convertedRetreatCost: This is already a numerical value and can be used as is. We just need to fill in any missing values with 0.
df['convertedRetreatCost'] = df['convertedRetreatCost'].fillna(0)
df['convertedRetreatCost'] = df['convertedRetreatCost'].replace('.', 0).astype(int)
  • number: We will convert this to a numerical value by taking the subset number before or after any non-numeric characters. For example, “54a” would be converted to 54.
import re

df['number'] = df['number'].apply(
    lambda x: int(re.search(r'\d+', str(x)).group()) if pd.notna(x) and re.search(r'\d+', str(x)) else None
)

Simple Categorical Features

Next we can look at the simple categorical features that have a limited number of unique values:

  • rarity: We can use one-hot encoding to convert this categorical feature into multiple binary features.
from sklearn.preprocessing import OneHotEncoder

hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
df['rarity'] = df['rarity'].fillna('Unknown')
rarity_encoded = hot_encoder.fit_transform(df[['rarity']])
rarity_encoded_df = pd.DataFrame(rarity_encoded, columns=hot_encoder.get_feature_names_out(['rarity']))
df = pd.concat([df, rarity_encoded_df], axis=1)
df.drop(columns=['rarity'], inplace=True)
  • evolvesFrom: We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
df['evolvesFrom'] = df['evolvesFrom'].notnull().astype(int)
  • evolvesTo: This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
df['evolvesTo'] = df['evolvesTo'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))

List-Based Categorical Features

Next we can look at the list-based categorical features. For these features, we will need to extract the modifiers from weaknesses and resistances so we first can define a function to do that. Then we can proceed with the feature extraction.

def extract_modifiers(modifier_list):
  if not isinstance(modifier_list, list):
    return (0, 0)

  total_multiplier = 0
  total_modifier = 0

  for item in modifier_list:
    value_str = item['value'].strip()

    if '×' in value_str:
      numeric_part = value_str.replace('×', '')
      total_multiplier += int(numeric_part)
    elif '+' in value_str or '-' in value_str:
      total_modifier += int(value_str)
          
  return (total_multiplier, total_modifier)
  • subtypes: We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

subtypes_encoded = mlb.fit_transform(df['subtypes'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
subtypes_encoded_df = pd.DataFrame(subtypes_encoded, columns=[f'subtype_{cls}' for cls in mlb.classes_])
df = pd.concat([df, subtypes_encoded_df], axis=1)
df.drop(columns=['subtypes'], inplace=True)
  • types: Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
types_encoded = mlb.fit_transform(df['types'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
types_encoded_df = pd.DataFrame(types_encoded, columns=[f'type_{cls}' for cls in mlb.classes_])
df = pd.concat([df, types_encoded_df], axis=1)
df.drop(columns=['types'], inplace=True)
  • weaknesses: We can extract three features from this column:
    • weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
    • total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
    • total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
mlb_weakness = MultiLabelBinarizer()

weakness_encoded = mlb_weakness.fit_transform(
    df['weaknesses'].apply(
        lambda x: [w['type'] for w in x] if isinstance(x, list) else []
    )
)
weakness_encoded_df = pd.DataFrame(weakness_encoded, columns=[f'weakness_{cls}' for cls in mlb_weakness.classes_])
df = pd.concat([df, weakness_encoded_df], axis=1)

total_weakness_values = df['weaknesses'].apply(extract_modifiers)
df[['total_weakness_multiplier', 'total_weakness_modifier']] = pd.DataFrame(
  total_weakness_values.tolist(), 
  index=df.index
)
df.drop(columns=['weaknesses'], inplace=True)
  • resistances: Similar to weaknesses, we can extract three features from this column:
    • resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
    • total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
    • total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
mlb_resistance = MultiLabelBinarizer()

resistance_encoded = mlb_resistance.fit_transform(
    df['resistances'].apply(
        lambda x: [r['type'] for r in x] if isinstance(x, list) else []
    )
)
resistance_encoded_df = pd.DataFrame(resistance_encoded, columns=[f'resistance_{cls}' for cls in mlb_resistance.classes_])
df = pd.concat([df, resistance_encoded_df], axis=1)

total_resistance_values = df['resistances'].apply(extract_modifiers)
df[['total_resistance_multiplier', 'total_resistance_modifier']] = pd.DataFrame(
    total_resistance_values.tolist(),
    index=df.index
)
df.drop(columns=['resistances'], inplace=True)

High-Cardinality Categorical Features

Lets take a look at the categorical features that have a high number of unique values:

  • pokedex_frequency: We can count the number of times each pokedex number appears in the dataset and use that as a feature.
# Convert lists to tuples (hashable) for frequency counting
df['pokedex_frequency'] = df['nationalPokedexNumbers'].apply(
    lambda x: tuple(x) if isinstance(x, list) else None
).map(
    df['nationalPokedexNumbers'].apply(
        lambda x: tuple(x) if isinstance(x, list) else None
    ).value_counts()
)
df.drop(columns=['nationalPokedexNumbers'], inplace=True)
  • artist: We can count the number of times each artist appears in the dataset and use that as a feature.
df['artist_frequency'] = df['artist'].map(df['artist'].value_counts())
df.drop(columns=['artist'], inplace=True)

Complex JSON/Text Features

Finally, we have the more complex features that are in JSON format or text:

  • abilities: I will split this into three features:
    • ability_count: The number of abilities the pokemon has.
    • ability_text: The combined text of all abilities.
    • has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
df['ability_count'] = df['abilities'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['ability_text'] = df['abilities'].apply(lambda x: ' '.join([ability['text'] for ability in x]) if isinstance(x, list) else '')
df['has_pokemon_power'] = df['abilities'].apply(lambda x: int(any(ability['name'] in ['Poké-Body', 'Poké-Power'] for ability in x)) if isinstance(x, list) else 0)
df.drop(columns=['abilities'], inplace=True)
  • attacks: Similar to abilities, I will split this into three features:
    • attack_count: The number of attacks the pokemon has.
    • max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.
    • attack_cost: The total converted energy cost of all attacks.
df['attack_count'] = df['attacks'].apply(lambda x: len(x) if isinstance(x, list) else 0)

def extract_max_damage(attacks):
    if not isinstance(attacks, list) or len(attacks) == 0:
        return 0
    
    damages = []
    
    for attack in attacks:
        if isinstance(attack.get('damage'), str):
            damage_str = attack['damage'].replace('+', '').replace('-', '').replace('×', '').strip()
            if damage_str.isdigit():
                damages.append(int(damage_str))
                continue
            
        if isinstance(attack.get('text'), str):
            numbers = re.findall(r'\b(\d+)\b', attack['text'])
            if numbers:
                damages.append(max(int(num) for num in numbers))
    
    return max(damages, default=0)

df['max_damage'] = df['attacks'].apply(extract_max_damage)

df['attack_cost'] = df['attacks'].apply(lambda x: sum([len(attack['cost']) for attack in x]) if isinstance(x, list) else 0)
df.drop(columns=['attacks'], inplace=True)
  • rules: This can be converted into a binary feature indicating whether the pokemon has any special rules or not.
df['has_rules'] = df['rules'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
df.drop(columns=['rules'], inplace=True)
  • ancientTrait: This can also be converted into a binary feature indicating whether the pokemon has an ancient trait or not.
df["has_ancient_trait"] = df['ancientTrait'].apply(lambda x: int(isinstance(x, dict)))
df.drop(columns=['ancientTrait'], inplace=True)

What Our New Dataset Looks Like

After performing all the feature engineering steps, lets take a look at the first few rows of our new dataframe to see what it looks like now.

print(df.shape)
df.head(3)
(4470, 111)
level hp evolvesFrom convertedRetreatCost number flavorText evolvesTo level_was_missing primary_pokedex_number pokemon_count ... pokedex_frequency artist_frequency ability_count ability_text has_pokemon_power attack_count max_damage attack_cost has_rules has_ancient_trait
0 42.0 80 1 3 1 Its brain can outperform a supercomputer. Its ... 0 0 65 1 ... 29 471.0 1 As often as you like during your turn (before ... 0 1 30 3 0 0
1 52.0 100 1 3 2 A brutal Pokémon with pressurized water jets o... 0 0 9 1 ... 41 471.0 1 As often as you like during your turn (before ... 0 1 40 3 0 0
2 55.0 120 0 1 3 A rare and elusive Pokémon that is said to bri... 1 0 113 1 ... 25 471.0 0 0 2 80 6 0 0

3 rows × 111 columns

We can see that we have successfully transformed our original dataframe into a more structured format that is suitable for machine learning models. From 25 original columns, we now have 111 features that capture various aspects of the pokemon cards. We save this new dataframe to a csv file for future use.

df.to_csv('data/processed_pokemon_cards.csv', index=False)